Starting off with the usual imports, these should be second nature by now!
#For the dataframe and array manipulation
import pandas as pd
import numpy as np
#For visualization
import plotly
import plotly.express as px
We'll be starting with a basic example of how KNN works with an automatically generated dataset, so we'll also need the following imports.
from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
Sklearn's make_blobs() function is used to generate clustered data based on a variety of parameters; this is perfect for looking at how certain algorithms work.
We will also need the KNeighborsClassifier model for our classification task; note that we will need to import from sklearn.neighbors.
To start things off, we'll be creating a dataset with 400 total points clustered into two groups.
Using make_blobs(), we can also set the number of features each data point will have. We'll set the number of features to 2 so we can plot the features easily.
points, labels = make_blobs(n_samples=400, centers=2, n_features=2,
cluster_std=.5, center_box=(0,8), random_state=42)
points[:10]
array([[6.35909794, 4.50082194],
[5.91609935, 5.04648729],
[5.96146027, 4.74091132],
[3.25384479, 9.5320802 ],
[5.77694759, 4.57582734],
[5.8651685 , 4.96305873],
[5.05272837, 4.89099969],
[3.03422323, 7.2671336 ],
[3.36555424, 7.69139859],
[3.10336782, 6.98284506]])
labels[:10]
array([1, 1, 1, 0, 1, 1, 1, 0, 0, 0])
We now already have the data split into X and y sets, which is perfect for the training step.
However, we may want to create a dataframe from these numpy arrays to be able to visualize what the data actually looks like.
data = pd.DataFrame(points, columns=["length","weight"])
data['target'] = labels.astype(str)
data.head(10)
| length | weight | target | |
|---|---|---|---|
| 0 | 6.359098 | 4.500822 | 1 |
| 1 | 5.916099 | 5.046487 | 1 |
| 2 | 5.961460 | 4.740911 | 1 |
| 3 | 3.253845 | 9.532080 | 0 |
| 4 | 5.776948 | 4.575827 | 1 |
| 5 | 5.865169 | 4.963059 | 1 |
| 6 | 5.052728 | 4.891000 | 1 |
| 7 | 3.034223 | 7.267134 | 0 |
| 8 | 3.365554 | 7.691399 | 0 |
| 9 | 3.103368 | 6.982845 | 0 |
There are two species of fish: red fish and blue fish. A target of 0 represents a red fish and a target of 1 represents a blue fish. For each fish, we have its length and weight.
Now, lets create a scatter plot with each feature on a seperate axis.
Each point will have a color corresponding to its target value, the value that we want to predict.
px.scatter(data, x='length', y='weight',color='target', color_discrete_map={'0': 'red', '1': 'blue'})
As can be seen in the plot, our data is nicely split into two distinct clusters. Because of this, we can make a pretty reasonable guess about the target value of a new point. Red fish tend to be a shorter and heavier, while blue fish are longer and lighter.
Let's train the model and see if it matches our intuition. Since we're only predicting a single point at a time, we'll train the model on the entire dataset.
X = data.drop(columns=["target"])
y = data["target"]
clf = KNeighborsClassifier(n_neighbors=10)
clf.fit(X,y)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=10, p=2,
weights='uniform')
Instead of doing train_test_split, we're going to select our own test points to get a better sense of the model.
Let's predict three distinct points and see what the model decides to output.
y_pred1 = clf.predict([[6,6.5]])
y_pred2 = clf.predict([[2,6]])
y_pred3 = clf.predict([[4,6]])
import matplotlib.pyplot as plt
LABEL_COLOR_MAP = {0:'r',1:'b'}
label_color = [LABEL_COLOR_MAP[l] for l in labels]
plt.scatter(x=data['length'],y=data['weight'],c=label_color)
plt.xlabel('length')
plt.ylabel('weight')
plt.scatter(x=6,y=6.5,marker="x",c="g",s=100)
plt.show()
print(y_pred1)
['1']
plt.scatter(x=data['length'],y=data['weight'],c=label_color)
plt.xlabel('length')
plt.ylabel('weight')
plt.scatter(x=2,y=6,marker="x",c="g",s=100)
plt.show()
print(y_pred2)
['0']
plt.scatter(x=data['length'],y=data['weight'],c=label_color)
plt.xlabel('length')
plt.ylabel('weight')
plt.scatter(x=4,y=6,marker="x",c="g",s=100)
plt.show()
print(y_pred3)
['0']
Success! Each of the predictions matched our intuition!
We tested the algorithm on a small, artificial dataset with only 2 features, but how well does it perform on real world data with potentially many features?
We will be using a breast cancer dataset to hopefully find an answer to this question.
url = 'https://raw.githubusercontent.com/ishaandey/node/master/week-8/workshop/breast_cancer.csv'
df = pd.read_csv(url)
df = df.drop(columns=['id','Unnamed: 32'])
df['diagnosis'] = df['diagnosis'].map({'M':1,'B':0})
X = df.drop(columns=['diagnosis'])
y = df['diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
By default, the KNeighborsClassifier uses a value of K=5.
predicted = clf.predict(X_test)
actual = np.array(y_test)
print('Look at first 10 predictions:')
print('Predicted: ',predicted[:10])
print('Actual: ',actual[:10])
Look at first 10 predictions: Predicted: [0 1 1 0 0 1 1 1 0 0] Actual: [0 1 1 0 0 1 1 1 0 0]
k5 = accuracy_score(predicted,actual)
print(k5)
0.956140350877193
Looks like the default classifier gave a pretty decent score. However, we need to note that accuracy is not a great metric for determining model performance and the score itself is relative.
While the default classifier was decent, we can also adjust the number of neighbors that the model looks at to get potentially better results.
k_values = [1,5,10,50,200]
scores = []
clf1 = KNeighborsClassifier(n_neighbors=1)
clf1.fit(X_train, y_train)
predicted = clf1.predict(X_test)
actual = np.array(y_test)
k1 = accuracy_score(predicted,actual)
scores.append(k1)
scores.append(k5)
print(k1)
0.9298245614035088
clf10 = KNeighborsClassifier(n_neighbors=10)
clf10.fit(X_train, y_train)
predicted = clf10.predict(X_test)
actual = np.array(y_test)
k10 = accuracy_score(predicted,actual)
scores.append(k10)
print(k10)
0.9649122807017544
clf50 = KNeighborsClassifier(n_neighbors=50)
clf50.fit(X_train, y_train)
predicted = clf50.predict(X_test)
actual = np.array(y_test)
k50 = accuracy_score(predicted,actual)
scores.append(k50)
print(k50)
0.9473684210526315
clf200 = KNeighborsClassifier(n_neighbors=200)
clf200.fit(X_train, y_train)
predicted = clf200.predict(X_test)
actual = np.array(y_test)
k200 = accuracy_score(predicted,actual)
scores.append(k200)
print(k200)
0.9035087719298246
plt.plot(k_values,scores,'go--', linewidth=2, markersize=12)
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy of Model')
plt.title('K value versus Accuracy')
plt.show()
As we can see, adding more neighbors does not equate to higher model accuracy. Each dataset is unique, and we need to select a value for K based on the nuances in the data to get the best performance from our KNN model.
For this particular dataset, a K value of 10 (look for the 10 nearest datapoints) gave us the best results on the unknown data.